{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "source": [
        "!apt-get install poppler-utils tesseract-ocr"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "pI-qkRY5r7VT",
        "outputId": "29a90889-5be4-4947-edee-aad61298fdbe"
      },
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Reading package lists... Done\n",
            "Building dependency tree... Done\n",
            "Reading state information... Done\n",
            "poppler-utils is already the newest version (22.02.0-2ubuntu0.5).\n",
            "The following additional packages will be installed:\n",
            "  tesseract-ocr-eng tesseract-ocr-osd\n",
            "The following NEW packages will be installed:\n",
            "  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd\n",
            "0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.\n",
            "Need to get 4,816 kB of archives.\n",
            "After this operation, 15.6 MB of additional disk space will be used.\n",
            "Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]\n",
            "Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]\n",
            "Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]\n",
            "Fetched 4,816 kB in 0s (27.0 MB/s)\n",
            "Selecting previously unselected package tesseract-ocr-eng.\n",
            "(Reading database ... 123635 files and directories currently installed.)\n",
            "Preparing to unpack .../tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1.1_all.deb ...\n",
            "Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...\n",
            "Selecting previously unselected package tesseract-ocr-osd.\n",
            "Preparing to unpack .../tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1.1_all.deb ...\n",
            "Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...\n",
            "Selecting previously unselected package tesseract-ocr.\n",
            "Preparing to unpack .../tesseract-ocr_4.1.1-2.1build1_amd64.deb ...\n",
            "Unpacking tesseract-ocr (4.1.1-2.1build1) ...\n",
            "Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1.1) ...\n",
            "Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1.1) ...\n",
            "Setting up tesseract-ocr (4.1.1-2.1build1) ...\n",
            "Processing triggers for man-db (2.10.2-1) ...\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "id": "0pK6JsOmUttl"
      },
      "outputs": [],
      "source": [
        "!pip install -qU \\\n",
        "    \"unstructured[pdf]==0.15.13\" \\\n",
        "    nltk==3.9.1"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import nltk\n",
        "\n",
        "nltk.__version__  # confirm you see 3.9.1, otherwise restart session"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "gRT8q2yL0K5E",
        "outputId": "7bf05506-52c9-4dc2-9e7b-ff978dbc6c40"
      },
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'3.9.1'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 1
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "We need to download NLTK `punkt` tokenizer otherwise we will see:\n",
        "\n",
        "```\n",
        "Resource punkt not found. Please use the NLTK Downloader to obtain the resource\n",
        "```"
      ],
      "metadata": {
        "id": "1jJjp1RkYejq"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "nltk.download(\"punkt\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QBcSbBgNxwR6",
        "outputId": "950a94a3-c76f-4e1c-b5e7-1d9d230b4511"
      },
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {},
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from unstructured.partition.auto import partition\n",
        "\n",
        "article_url = \"https://arxiv.org/pdf/1706.03762.pdf\"\n",
        "title = \"Attention is All You Need\"\n",
        "elements = partition(\n",
        "    url=article_url,\n",
        "    strategy=\"fast\",\n",
        "    skip_infer_table_types=[],\n",
        ")"
      ],
      "metadata": {
        "id": "Yq36GAEv6K6R"
      },
      "execution_count": 52,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "This outputs everything into unstructured elements for us:"
      ],
      "metadata": {
        "id": "YP2iOxfU2G50"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "elements[:10]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CWOuVTpsrq4P",
        "outputId": "4a1eb89c-2f52-4465-da4a-e83192165527"
      },
      "execution_count": 40,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[<unstructured.documents.elements.Text at 0x7a3648901720>,\n",
              " <unstructured.documents.elements.NarrativeText at 0x7a3648901570>,\n",
              " <unstructured.documents.elements.Title at 0x7a3648968850>,\n",
              " <unstructured.documents.elements.Text at 0x7a3648968a60>,\n",
              " <unstructured.documents.elements.NarrativeText at 0x7a3648968700>,\n",
              " <unstructured.documents.elements.Title at 0x7a3647f08d00>,\n",
              " <unstructured.documents.elements.Title at 0x7a3647f093c0>,\n",
              " <unstructured.documents.elements.Title at 0x7a3647f09fc0>,\n",
              " <unstructured.documents.elements.Title at 0x7a3647f0a410>,\n",
              " <unstructured.documents.elements.Title at 0x7a3647f0ae60>]"
            ]
          },
          "metadata": {},
          "execution_count": 40
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "These are not perfect however, ideally you want to be post-processing these elements to merge and/or split."
      ],
      "metadata": {
        "id": "PGhYm6Jy8TYj"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for elem in elements[:6]:\n",
        "    print(elem.text)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "cXEGG8Yu8Iyq",
        "outputId": "3a9d6fe6-53c2-4182-f9b5-4ed089200699"
      },
      "execution_count": 46,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "3 2 0 2\n",
            "g u A 2\n",
            "] L C . s c [\n",
            "7 v 2 6 7 3 0 . 6 0 7 1 : v i X r a\n",
            "Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.\n",
            "Attention Is All You Need\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Using this we can begin to handle our data in a way that is context and structure aware. For example we may prefix headers to chunks. Let's try this section where we have `Title` elements followed by `NarrativeText` elements:"
      ],
      "metadata": {
        "id": "n5VaSIxE2Rbi"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "elements[30:45]"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "T2DHEBKH83gq",
        "outputId": "10f6f886-6bd1-402a-e51a-6e83a1d0d91c"
      },
      "execution_count": 49,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[<unstructured.documents.elements.Title at 0x7a3648986230>,\n",
              " <unstructured.documents.elements.NarrativeText at 0x7a3648985690>,\n",
              " <unstructured.documents.elements.Footer at 0x7a36489869b0>,\n",
              " <unstructured.documents.elements.Title at 0x7a3647cfeaa0>,\n",
              " <unstructured.documents.elements.NarrativeText at 0x7a3647cfea40>,\n",
              " <unstructured.documents.elements.Title at 0x7a3647cfe590>,\n",
              " <unstructured.documents.elements.NarrativeText at 0x7a364cf14400>,\n",
              " <unstructured.documents.elements.NarrativeText at 0x7a3647cffdc0>,\n",
              " <unstructured.documents.elements.Title at 0x7a3647cffd90>,\n",
              " <unstructured.documents.elements.NarrativeText at 0x7a3647cfc760>,\n",
              " <unstructured.documents.elements.Footer at 0x7a3647cfefe0>,\n",
              " <unstructured.documents.elements.Title at 0x7a36488846a0>,\n",
              " <unstructured.documents.elements.Title at 0x7a3648885600>,\n",
              " <unstructured.documents.elements.NarrativeText at 0x7a3648885900>,\n",
              " <unstructured.documents.elements.NarrativeText at 0x7a3648885d80>]"
            ]
          },
          "metadata": {},
          "execution_count": 49
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "for elem in elements[30:45]:\n",
        "    print(elem.text)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "X3Fs3rD88t9F",
        "outputId": "f6acccd6-808a-4e22-f5e8-9d77b22bca0c"
      },
      "execution_count": 48,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "3 Model Architecture\n",
            "Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.\n",
            "2\n",
            "Figure 1: The Transformer - model architecture.\n",
            "The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.\n",
            "3.1 Encoder and Decoder Stacks\n",
            "Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.\n",
            "Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.\n",
            "3.2 Attention\n",
            "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum\n",
            "3\n",
            "Scaled Dot-Product Attention\n",
            "Multi-Head Attention\n",
            "Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.\n",
            "of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "We will merge these blocks to create more context-aware chunks:"
      ],
      "metadata": {
        "id": "eOcZkAjc88E7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from unstructured.documents.elements import Title, NarrativeText\n",
        "\n",
        "header = \"\"\n",
        "chunks = []\n",
        "for elem in elements[30:45]:\n",
        "    if isinstance(elem, Title):\n",
        "        header = elem.text\n",
        "    elif isinstance(elem, NarrativeText):\n",
        "        chunks.append(f\"Document: {title}\\nHeader: {header}\\nContent: {elem.text}\")"
      ],
      "metadata": {
        "id": "eObeBj6L9C4S"
      },
      "execution_count": 55,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "print(chunks[0])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "xOi1GL7s9E3S",
        "outputId": "f135b446-0ef1-44a4-b54a-6d8d5427e2a8"
      },
      "execution_count": 56,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Document: Attention is All You Need\n",
            "Header: 3 Model Architecture\n",
            "Content: Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Realistically we would want to be _chunking_ our content _before_ augmenting those chunks with additional information, but we can see from this how we can provide more context by simply considering the structure of data."
      ],
      "metadata": {
        "id": "rD0nQ7Rg9VKy"
      }
    }
  ]
}